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Abstract 

We consider a Zipf-Poisson ensemble in which Xi ~ Poi(A r i _a ) for 
a > 1 and N > and integers i > 1. As iV — > oo the first n'(N) random 
variables have their proper order Xi > X2 > • ■ • > X n i relative to each 
other, with probability tending to 1 for n! up to (AN/ log(iV)) 1/(a+2) for 
an explicit constant A(a) > 3/4. The rate Af 1/ '' a+2 - ) cannot be achieved. 
The ordering of the first n'(N) entities does not preclude X m > X n i for 
some interloping m > n'. The first n" random variables are correctly 
ordered exclusive of any interlopers, with probability tending to 1 if n" < 
(BN/log(N)) 1/(a+2) for B < A. For a Zipf-Poisson model of the British 
National Corpus, which has a total word count of 100,000,000, our result 
estimates that the 72 words with the highest counts are properly ordered. 



1 Introduction 



Power law distributions are ubiquitous, arising in studies of degree distributions 
of large networks, book and music sales counts, frequencies of words in literature 
and even baby names. It is common that the relative frequency of the i'th most 
popular term falls off roughly as i~ a for a constant a slightly larger than 1. This 
is the pattern made famous by Zipf ( 1949 1 . Data usually show some deviations 
from a pure Zipf model. The Zipf-Mandelbrot law for which the i'th frequency 
is proportional to (i + k)~ a where k > is often a much better fit. That and 



many other models are described in Popescu (2009 Chapter 9) 



The usual methods for fitting long tailed distributions assume an IID sample. 
However, in many applications a persistent set of entities is repeatedly sampled 
under slightly different conditions. For example, if one gathers a large sample of 
English text, the word 'the' will be the most frequent word with overwhelming 
probability. No other word has such high probability and repeated sampling 
will not give either zero or two instances of such a very popular word. Similarly 
in Internet applications, the most popular URLs in one sample are likely to 
reappear in a similar sample, taken shortly thereafter or in a closely related 
stratum of users. The movies most likely to be rated at Netflix in one week will, 
with only a few changes, be the same popular movies the next week. 



Because the entities themselves have a meaning beyond our sample data, it 
is natural to wonder whether they are in the correct order in our sample. The 
problem we address here is the error in ranks estimated from count data. By 
focussing on count data we are excluding other long tailed data such as the 
lengths of rivers. 

In a large data set, the top few most popular items are likely to be correctly 
identified while the items that appeared only a handful of times cannot be 
confidently ordered from the sample. We are interested in drawing the line 
between ranks that are well identified and those that may be subject to sampling 
fluctuations. One of our motivating applications is a graphical display in Dyer 



and Owen (20101. Using that display one is able to depict a head to tail affinity 
for movie ratings data: the busiest raters are over represented in the most 
obscure movies and conversely the rare raters are over represented in ratings of 
very frequently rated movies. Both effects are concentrated in a small corner 
of the diagram. The graphic has greater importance if it applies to a source 
generating the data than if it applies only to the data at hand. 

This paper uses the Zipf law because it is the simplest model for long tailed 
rank data and we can use it to get precisely stated asymptotic results. If in- 
stead we are given another distribution, then a numerical method described in 
Section [4] is very convenient to apply. 

If we suppose that the item counts are independent and Poisson distributed 
with expectations that follow a power law, then a precise answer is possible. 
We define the Zipf-Poisson ensemble to be an infinite collection of independent 
random variables Xj ~ Poi(Aj) where = Ni~ a for parameters a > 1 and 
N > 0. Our main results are summarized in Theorem [T] below. 

Theorem 1. Let Xj be sampled from the Zipf-Poisson ensemble with parameter 
a>l. Ifn = n{N) < (AN/ log(AT)) 1/(Q+2) for A = a 2 (a + 2)/A, then 

lim P(Xi > X 2 > • • • > X n ) = 1. (1) 

Ifn = n(N) < (BN/\og(N)) 1 ^ a+2 ^ for B < A, then 

lim P(Xi > X 2 > • • • > X„ > maxXi) = 1. (2) 

JV-s-oo v i>n ' 

Ifn = n(N) > CA^ 1 /( Q+2 ) for any C > 0, then 

lim P(Xi > X 2 > ■■■> X n ) = 0. (3) 

Equation states that the top n' = [{AN/ log(7V)) 1/(a+2) J entities, with 
A — a 2 (a + 2)/4, are correctly ordered among themselves with probabilty tend- 
ing to 1 as N — !• oo. From a > 1 we have A > 3/4. Equation ^ shows that 
we cannot remove log(A^) from the denominator, because the first CN 1 ^ 01 ^ 
entities will fail to have the correct joint ordering with a probability approaching 
1 as N — > oo. 

Equation ([T]) leaves open the possibility that some entity beyond the n'th 
manages to get among the top n' entities due to sampling fluctuations. Those 
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entities each have only a small chance to be bigger than X n t, but there are 
infinitely many of them. Equation |2]) shows that with probability tending to 
1, the first n" = [(BN/ log(N)) 1 ^ a ~^> \ entities are the correct first n" entities 
in the correct order. The limit holds for any B < A. That is, there is very little 
scope for interlopers. 

Section [2] shows an example based on 100,000,000 words of the British Na- 
tional Corpus (BNC). See Aston and Burnard (1998). Using a near 1.1 in the 
asymptotic formulas, we estimate that the first 72 words are correctly ordered 
among themselves. In a Monte Carlo simulation, very few interloping counts 
were seen. The estimate n' — 72 depends on the ZipfoPoisson assumption which 
is an idealization, but it is quite stable if the log-log relationship is locally linear 
in a critical region of n values. 

Section[3]proves our results. Of independent interest there is Lemma[l]whicli 
gives a Chernoff bound for the |Skellam ( 1946 1 distribution: For A > v > we 
show that P(Poi(A) < Poi(i/)) < exp(-(\/A - ^/v) 2 ) where Poi(A) and Yoi(v) 
are independent Poisson random variables with the given means. Section [4] has 
our conclusions. 



2 Example: the British National Corpus 

Figure[T]plots the frequency of English words versus their rank on a log-log scale, 
for all words appearing at least 800 times among the approximately 100 million 



words of the BNC. The counts are from Kilgarrif (2006). The data plotted have 



a nearly linear trend with a slope just steeper than —1. They are not perfectly 
Zipf-like, but the fit is extraordinarily good considering that it uses just one 
parameter for 100 million total words. 

The top 10 counts from Figure [T] are shown in Table [T] The most frequent 
word 'the' is much more frequent than the second most frequent word 'be'. The 
process generating this data clearly favors the word 'the' over 'be' and a p- value 
for whether these words might be equally frequent, using Poisson assumptions 
is overwhelmingly significant. Though the 9'th and 10'th words have counts 
that are within a few percent of each other, they too are significantly different, 
as judged by (X 9 — X w )/ ^/X 9 + X w = 34.9, the number of estimated standard 
deviations separating them. The 500'th and 501'st most popular words are 
'report' and 'pass' with counts of 20,660 and 20,633 respectively. These are not 
significantly different. 

We will use a value of a close to 1.1 to illustrate the results of Theorem [TJ 
The data appear to have approximately this slope in what will turn out to be 
the important region, with ranks from 10 to 100. We don't know N but we can 

estimate it. Let T = J^Zi x i be the total count - Thcn E ( T ) = ESi Ni ~ a = 
N((a) where ((■) is the Riemann zeta function. We find that C( a *) = 10 f° r 
a* = 1.106. Choosing a = a* we find that T = 10 8 corresponds to N = N* = 
10 7 . 

Theorem [l] has the top n' — (A(a)N/(log(N))) 1 ^ a+2 ^ entities correctly or- 
dered among themselves with probability tending to 1. For the BNC data we 
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Zipf plot for BNC 
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Figure 1: Zipf plot for the British National Corpus data. The reference line 
above the data has slope —1, while that below the data has slope —1.1. 



Rank Word Count 



1 


the 


6,187,267 


2 


be 


4,239,632 


3 


of 


3,093,444 


4 


and 


2,687,863 


5 


a 


2,186,369 


6 


in 


1,924,315 


7 


to 


1,620,850 


8 


have 


1,375,636 


9 


it 


1,090,186 


10 


to 


1,039,323 



Table 1: The top ten most frequent words from the British National Corpus, 
with their frequencies. Item 7 is the word 'to', used as an infinitive marker, 
while item 10 is 'to' used as a preposition. In "I went to the library to read." 
the first 'to' is a preposition and the second is an infinitive marker. 
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Figure 2: The Zipf-Poisson ensemble with N = 10 7 and a = 1.106 was sim- 
ulated 1000 times. The histogram shows the distribution of the number of 
correctly ordered words. 

get n' = (A(a*)iV,/(log(iV >t ))) 1/(Q * +2) = 72.08. For data like this, we could rea- 
sonably expect the true 72 most popular words to be correctly ordered among 
themselves. 

We did a small simulation of the Zipf-Poisson model. The results are shown 
in Figure [2] The number of correctly ordered items ranged from 69 to 153 
in those 1000 simulations. The number was only smaller than 72 for 2 of the 
simulated cases. 

In our simulation, the first rank error to occur was usually a transposition 
between the n'th and n + l'st entity. This happened 982 times. There were 7 
cases with a tie between the n'th and n + l'st entity. The remaining 11 cases 
all involved the n + 2'nd entity getting ahead of the n'th. As a result, we see 
that interlopers are very rare, as we might expect from Lemma |4j 

Lack of fit of the Zipf-Poisson model will affect the estimate somewhat. Here 
we give a simple analysis to show that the estimate n' = 72 is remarkably stable. 
Even though the log-log plot in Figure [l] starts off shallower than slope —1, the 
first 10 counts are so large that we can be confident that they are correctly 
ordered. Similarly, the log-log plot ends up somewhat steeper than —1.1, but 
that takes place for very rare items that have negligible chance of ending up 
ahead of the 72'nd word. As a result we choose to work with a — 1.106 and 
re-estimate N to match the counts in the range 10 < n < 100. Those counts 
are large and very stable. A simple estimate of N is N; = Xii a . For this data 
minio<i<ioo A, = 1.25 x 10 7 . Using N = 1.25 x 10 7 with a = 1.106 and B = 1 
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Figure 3: This figure plots (Xi — X i+ i) / Xi + X i+ i versus i = 1, . . . , 100 for 
the BNC data. The horizontal reference lines are drawn at 2.5 standard errors 
and at 1.0 standard errors. The vertical line is drawn at n' — 1 = 71.08. 



gives n' = 77.10 raising the estimate only slightly from 72. The value a* = 1.106 
was chosen partly based on fit and partly based on numerical convenience that 
C(a*) = 10. Repeating our computation with 1.05 < a < 1.15 gives values of 
N that range from 1.1 x 10 7 to 1.4 x 10 7 , and estimates n' from 71.04 to 77.29. 
This estimate is very stable because the Zipf curve is relatively straight in the 
critical region. 

There is enough wiggling in the log-log plot Figure [T] between ranks 10 and 
50, that can be attributed to E(Xi) not perfectly following a local power law 
there. The British National Corpus rank orderings are not quite as reliable as 
those in the fitted Zipf-Poisson model. Unsurprisingly, a one parameter model 
shows some lack of fit on this enormous data set. 

Figure [3] plots standard errors for the first 100 consecutive word comparisons. 
A horizontal line is at 2.5. The theorem predicts that the first n' = 72.08 = 72 
words would be correctly ordered relative to each other. When the first n' words 
are correctly ordered the first n' — \ differences have the correct sign. The 
vertical reference line is at 71.08. Beyond 72 it is clear that many consecutive 
word orderings are doubtful. We also see a few small standard errors for n < 72 
which correspond to some local flat spots in Figure [T] As a result we might 
expect a small number of transposition errors among the first 72 words, in 
addition to the large number of rank errors that set in beyond the 72 'nd word, 
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as predicted by the Zipf-Poisson model. 



3 Proof of Theorem [T] 

Theorem[I]has three claims. First, equation ([lj on correct ordering of the n most 
popular items within themselves, follows from Corollary [l] below. Combining 
that corollary with Lemma[4]to rule out interlopers, establishes ^ in which the 
first n items are correctly identified and ordered. The third claim showing 
the necessity of the logarithmic factor, follows from Corollary [2] 



3.1 Some useful inequalities 

The proof of Theorem [T] makes use of some bounds on Poisson probabilities and 
the gamma function, collected here. 

Let Y ~ Poi(A). |Shorack and Wellner (1986 page 485) have the following 
exponential bounds 



P(r<*)<(i--) 



for integers t > A, and (4) 



for integers t < A. (5) 



Klar| ( |2000[ ) shows that Q holds for t > A - 1. Equation Q holds for real 
valued t > A and equation ([5| also holds for real valued t < A. In both cases we 
interpret t\ as T(t + 1). 

A classic result of Teicher| ( |1955| ) is that 

P(Y < A) > exp(-l) (6) 

when Y - Poi(A). If Y ~ Poi(A), then 

'Y - A 



sup 

-oo<t<oo 



< t - 



< 



0.8 

7t' 



(7) 



where $ is the standard normal CDF. Equation Q follows by specializing a 
Berry-Esseen result for compound Poisson distributions (Michel, 1993, Theorem 
1) to the case of a Poisson distribution. 

We will also use Gautschi's (1959) inequality on the Gamma function, 



T(x + s 

which holds for x > and < s < 1. 



(8) 



3.2 Correct relative ordering, equation Q 

The difference of two independent Poisson random variables has a Skellam 



( 1946 ) distribution. We begin with a Chernoff bound for the Skellam distri- 
bution. 
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Lemma 1. Let Z = X — Y where X ~ Poi(A) and Y ~ Poi(^) are independent 
and X > v . Then 

< 0) < cxp(-(%/A- ^) 2 ). (9) 



Proof. Let = Ae~ f +i/e . Then y> is a convex function attaining its minimum 
at t* = \og(\J\/v) > 0, with <p(t*) = 2y\u. Using the Laplace transform of 
the Poisson distribution 

m(t) = E(e- tz ) = ^-*-%"(e*-i) = e -(A+,) eV ,(t)_ 

For t > 0, Markov's inequality gives P(Z < 0) = P(e~ tz > 1) < E(e~ tz ). 
Taking t = t* yields ([o]) . □ 

Lemma 2. Let Xi be sampled from the Zipf-Poisson ensemble. Then for n > 2, 

Ppfi >X 2 > ■•• >X n ) > 1 - ncxp^ 1 ^- n- a - 2 y (10) 

Proof. By Lemma [T] and the Bonferroni inequality, the probability that X i+ i > 
Xi holds for any i < n is no more than 



^Texp(-(v/A~- v/A^O 2 ) = ^ ex p(-jV(^- y/e^Y). (11) 

8=1 1=1 

Forx > 1, let /(x) = x- a / 2 . Then IV^-v^l = + = l/'( z )l for 

some z e (i, i + Because |/'| is decreasing, (11) is at most n exp(— Nf'(n) 2 ), 
establishing ( 10 ). □ 

Now we can establish the first claim in Theorem [TJ 

Corollary 1. Let Xi be sampled from the Zipf-Poisson ensemble. Choose n — 
n(N) > 2 so that n < (AN/ log(7V)) 1/(Q+2) holds for all large enough N where 
A = a 2 {a + 2) /A. Then 

lim V(X 1 > X 2 > ■■■> X n ) = 1. 

Proof. For large enough N we let n = (A N N/ log(iV)) 1 /( Q+2 ) for A N < A. Then 
Na 2 / A N N \ 1/(q+2) 




iv -a 2 /(4a 2 (a+2)/4) 



nexp 



, 1/(q+2) 

' Jog(iV), 
0. 

The proof then follows from Lemma [2j □ 
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3.3 Correct absolute ordering, equation §2§ 

For the second claim in Theorem [I] we need to control the probability that one 
of the entities from the tail given by i > n, can jump over one of the first 
n entities. Lemma [3] bounds the probability that an entity from the tail of the 
Zipf-Poisson ensemble can jump over a high level r. 

Lemma 3. Let Xi for i > 1 be from the Zipf-Poisson ensemble with parameter 
a > 1 . If t > X n then 

PfmaxX, > t) < t. (12) 

\i>n / a t + 1 — A n r— 1/Q 

Proof. First, P(max l>n X t > r) < ESn+i v ( x i > T ) and thcn from © 
?( max A, >t) <(l A " ' 



\i>n 1 J ~\ t + 1) ^ r(r+l)' 

i— n+1 ' 

Now Ai = Ni~ a . For i > n we have r > Aj = Ni~ a . Over this range, e~ A A r is 
an increasing function of A. Therefore, 



J] e _Al A[ < / e- Vx ~ a (i\raT a ) T da; 

,1 J n 



i=n-\-\ 



Art/a fWn ° 

< / e-V" 1/Q_1 dy 



AT 1 /" 

< rfr-l/a). 



As a result 



Now 



/ v \ iV 1 /" t + 1 rfr-l/a) 

' maxX > r < )- '—^ 

\i>n J ~ a t + 1 - A„ T(r + 1 



r(r- 1/a) _ T(r + 1 - 1/a) 1 r^ 1 /" 



r(r + l) r(r + l) T-l/a T-l/a 

by Gautschi's inequality pi), with s = 1 — 1/a, establishing (12). □ 

For an incorrect ordering to arise, cither an entity from the tail exceeds a 
high level, or an entity from among the first n is unusually low. Lemma [4] 
uses a threshold for which both such events are unlikely, establishing the second 
claim ([2]) of Theorem [I] 

Lemma 4. Let Xi for i > 1 be from the Zipf-Poisson ensemble with parameter 
a > 1. Let n(N) satisfy n > (AiV/log^iV)) 1 /^ 2 ) for < A < A(a) = 
a 2 (a + 2)/A. Let m < (BN/ log(AT)) 1/(Q+2) for < B < A. Then 

lim ¥(maxX l > X m ) = 0. (13) 

JV->oo V i>n J 
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Proof. For any threshold r, 

Pf maxl, > I ra ) < pfmaxX, > t) + P(X m < r). (14) 

The threshold we choose is r = -^/A m A„ where Ai = E(Xi) = Ni~ a . 

Write n = (A w iV/ log(A0) 1/(a+2) and m = (B N N/ log(A r )) 1 ^ Q+2 ) for < 
B N < B < A N < A < A(a). Then r = V%X = N(C N N/ log(A0r a/(Q+2) 
where Cat = -^Ma^n- Therefore 

r = ^2/(a+2) (log(7v))Q /(a+2)- ) _ 

By construction, r > A„ and so by Lemma [3] 

/ \ N 1 / r+1 r- 1 /" 

PI maxl, > r < 



\ «>n / a t + 1 — A„r— 1/a 

Because A„/r = ( J B A r/ J 4 A r) a /( 2Q + 4 ), we have (t + l)/(r + 1 - A„) = 0(1). 
Therefore 

pfmaxAT, > t) = OiN 1 ' ^ 1 ' - 1 ) = C^jV-V^+z) (log(A0) (a+1)/(Q+2) ) 

\ i>n / 



and so the first term in (14) tends to as N —> oo. 

For the second term in ( 14 ), notice that X rn has mean A m > r and standard 
deviation \/\ m - Letting p — a /(a + 2) and applying Chebychev's inequality, we 
find that 

P(X m <r)< A "' 



(T- Xm) 2 



B k -N^I^ 2 \\o^N)y 



(K-c p N y 



< A r ~ 2/( " +2) flogfA0 , r p 

- (Ap/ 2 - Bp/ 2 ) 2 V ^)) 



as AT — ;> oo. □ 

Lem ma |4| is sharp enough for our purposes. A somewhat longer argument 
in |Dyer| ( |2010[ ) shows that the interloper phenomenon is ruled out even deeper 
into the tail of the Zipf-Poisson ensemble. Specifically, if to < (BN) 13 and 
n > (AN) 13 for < B < A and (i < 1/a, then |l3]) still holds. 

3.4 Limit to correct ordering, equation ^ 

While we can get (AA71og(A0) lAQ + 2 ) entities properly ordered, there is a limit 
to the number of correctly ordered entities. We cannot get above CN 1 ^ 01 ^ 
correctly ordered entities, asymptotically. That is, the logarithm cannot be 
removed. We begin with a lower bound on the probability of a wrong ordering 
for two consecutive entities. 
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Lemma 5. Let Xi be from the Zipf-Poisson ensemble with a > 1. Suppose that 
AN i/( a +2) <i < i + \< BN i/(»+2) where < A < B < oo. Then for large 

enough N, 

1 / A a ' 2 

Proof. First P(X t+1 > X t ) > ¥(X l+1 > X t )¥(X t < X l ) > V(X t+1 > A;)/e using 
Teicher's inequality Next 



¥{X l+1 > Aj) = 1 — V(X l+1 < Aj) > $ 

Now, 



Aj+i — A, 



for some 77 € (0, 1). Applying the bounds on i, 

X t+1 -Xi (jyV("+ 2 U)°/ 2 A°/ 2 



Finally, letting AT — > oo we have Aj+i — > oo and so 0.8/-^ A*+i is eventually 
smaller than (1 - e/3)$(-a^ Q / 2 S- Q! - 1 ). Letting = -aA"/ 2 ^-"- 1 we have, 
for large enough N, 

F(X i+ i > X t ) > ($(0) - (l - i = □ 

To complete the proof of Theorem [I] we establish equation For n beyond 
a multiple of _/V 1 /( Q+2 - ) , the reverse orderings predicted by Lemma [H] cannot be 
avoided. 

Corollary 2. Let Xi be sampled from the Zipf-Poisson ensemble. Suppose that 
n = n(N) satisfies n > CN 1 /^ 2 ^ for < C < oo. Then 

lim P(Xi > X 2 > ■ ■ ■ > X n ) = 0. 

JV->oo 

Proof. Let p € (0, 1) be a constant such that P(JQ_|_i > JQ) > p holds for all 
large enough N and (C /2)N l ^ 2+a ") < i < i + 1 < CiV 1 /^). For instance 
Lemma| shows that p = $(-a(C/2)"/ 2 /C Q )/3 = $(-a(2C*)- Q / 2 )/3 is such a 
constant. Then 

P(Xl > X 2 > ■ ■ ■ > X n ) < JJ* F(X Z > X l+1 ) (15) 

i 

holds where J]* is over all odd integers i e [(C*/2)7V 1 /(«+2) j CN i/( a +2)y There 
are roughly CA^ 1 ^ Q+2 - ) /4 odd integers in the product. For large enough N, the 
right side of (fTsl) is below (1 - p )CW 1/(a+2) /5 ^ . □ 
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4 Discussion 



We have found that the top few entities in a Zipf plot of counts can be expected 
to be in the correct order, when their frequencies are measured with Poisson 
errors. Even in the idealized Zipf setting, the number of correctly ordered 
entities grows fairly slowly with N. 

Our transition point is at n! — (a 2 (a+2)iV/(41og(./V)) 1 /( Q+2 ) and estimating 
N from T = '^2 i Xi leads to the estimate 

a 2 (a + 2)r/C(a) \^ 
41og(T/C(a)) ) ■ 

The threshold n' uses some slightly conservative estimates to get a rate in 



N. For the Zipf-Poisson ensemble with TV = 10 7 and a = 1.106 we can use (10 1 
of Lemma [2] directly to find 

71 

1 - P(Xl > X 2 > ■■■> X 72 ) < ^]exp(-iV(r Q/2 - (i + 1)- Q/2 ) 2 ) = 0.0199. 

»=i 

We get a bound of 1% by taking n — 70 and a bound of 5% by taking n = 76. 
The formula for n comes remarkably close to what we get working directly with 



equation (10 1 



The Skellam bounds do not assume a Zipf rate for the Poisson means. There- 
fore we can use them to generalize the computation above. For example, with 
a Zipf— Mandelbrot-Poisson ensemble having X{ ~ Poi(7V(i + k)~ a ) we can still 



apply equation ( 10 ) to show that the probability of an error among the first n 
ranks is at most 



n-l 



p(n;N,a,k) = J2 ex p{- N {( i + k )~ a/2 - ( i + k + i y a/2 ) )• ( 16 ) 



A conservative estimate of the number of correct positions in the Zipf-Mandelbrot- 
Poisson ensemble is 

nf = max{n > 1 | p(n; N, a, k) < 0.01} (17) 

with n! = if p(l; N, a, k) > 0.01. We can estimate N by T/((a, k — 1) where 
T = Xi and ((a, h) = J27Lo(^ + h) a is the Hurwitz zeta function. 

Equation ( |17[ ) is conservative because it stems from the Bonferroni inequal- 
ity, and does not adjust for two or more order relations being violated. It will 
be less conservative for small target probabilities like 0.01 than for large ones 
where adjustments are relatively more important. 

Our focus is on the ranks that are correctly estimated. Methods to estimate 
parameters of the Zipf distribution or Zipf-Mandelbrot distribution typically 
use values of Xi for i much larger than the number of correctly identified items. 
It is not unreasonable to do so, because ordering errors tend to distribute the 
values of both above and below the parametric trend line. 
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A small number of correct unique words can correspond to a reasonably large 
fraction of word usage. The BNC is roughly 6.2% 'the' and the top 72 words 
comprise about 45.3% of the corpus. 

For large N, the top n e — N 1 ^ a+2 ' , ~ e entities get properly ordered with very 
high probability for < e < 1/ (a + 2) . The tail beyond n e accounts for a pro- 
portion of data close to ((a)- 1 /~ x~ a Ax = 0{ n - a+1 ) = 0(jv(i-«)/(«+2)-H') 
for e' = e(a — 1). Taking small e and recalling that a > 1 we find that the 
fraction of data from improperly ordered entities vanishes in the Zipf-Poisson 
ensemble. When a is just barely larger than 1 the rate may be slow. 
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